Code
import pandas as pd
import numpy as np
import os
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as snsIn this notebook, we carry out an in-depth exploratory and descriptive analysis of Real Estate Dataset, a widely used dataset for income prediction tasks based on assesed value,sales ratio ans sales amount attributes.
This phase of analysis is essential for uncovering patterns, detecting potential biases, and gaining intuition about the dataset’s structure before applying any modelling procedures. We examine the distribution of key numerical and categorical variables, investigate relationships between assesed value and sales amount, and use visualizations to summarize insights.
We begin our analysis by importing the core Python libraries required for data handling, numerical computation, visualization, and directory management:
pandas: Enables efficient manipulation, filtering, and aggregation of structured tabular data, forming the backbone of our analysis pipeline.
numpy: Provides support for fast numerical operations, array-based computation, and statistical routines.
os: Facilitates interaction with the file system, allowing us to construct flexible and portable directory paths for data and output management.
plotly.express: A high-level graphing library that enables the creation of interactive, publication-quality visualizations, which we use extensively to uncover patterns and present insights throughout the notebook.
import pandas as pd
import numpy as np
import os
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as snsTo ensure reproducibility and organized storage, we programmatically create directories if they don’t already exist for:
These directories will store intermediate and final outputs for reproducibility.
We load the cleaned version of the UCI Adult Income Dataset from the processed data directory into a Pandas DataFrame. The head(10) function shows the first ten records, giving a glimpse into the data columns such as Town, Address, Property Type, etc.
real_estate_filename = os.path.join(processed_dir, 'real_estate_cleaned.csv')
real_estate = pd.read_csv(real_estate_filename)
real_estate| Serial Number | List Year | Date Recorded | Town | Address | Assessed Value | Sale Amount | Sales Ratio | Property Type | Residential Type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 160091 | 2016 | 12/23/2016 | Avon | 2 EDGEWOOD | 143390 | 224000.0 | 0.640134 | Condo | appartments |
| 1 | 160172 | 2016 | 4/13/2017 | Bethel | 66 H NASHVILLE ROAD | 80500 | 130000.0 | 0.619231 | Condo | appartments |
| 2 | 160258 | 2016 | 6/29/2017 | Bethel | 1 EAGLE ROCK HILL | 117180 | 200000.0 | 0.585900 | Condo | appartments |
| 3 | 160103 | 2016 | 11/30/2016 | Branford | 137 PEDDLARS DR | 106700 | 155000.0 | 0.688387 | Condo | appartments |
| 4 | 167930 | 2016 | 9/27/2017 | Bridgeport | 95 LANCE CIR | 105820 | 148000.0 | 0.715000 | Condo | appartments |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 143962 | 190120 | 2019 | 4/28/2020 | Rocky Hill | 174 VALLEY CREST DRIVE | 174650 | 242000.0 | 0.721700 | Single Family | single-residence |
| 143963 | 190460 | 2019 | 8/17/2020 | Southbury | 10 CONCORD CT | 300250 | 412000.0 | 0.728800 | Condo | appartments |
| 143964 | 19200 | 2019 | 6/29/2020 | Monroe | 39 FARM VIEW RD | 238400 | 377400.0 | 0.631700 | Single Family | single-residence |
| 143965 | 190480 | 2019 | 2/25/2020 | New Haven | 498 HOWARD AV | 110740 | 248000.0 | 0.446500 | Two Family | duplex-residence |
| 143966 | 19022 | 2019 | 10/17/2019 | Watertown | 305 BEACH AVE | 192800 | 235000.0 | 0.820400 | Single Family | single-residence |
143967 rows × 10 columns
real_estate_filename = os.path.join(processed_dir, 'real_estate_cleaned.csv')
real_estate = pd.read_csv(real_estate_filename)
real_estateHere, we examine the structure of the dataset:
list year, dates recorded) and categorical variables (e.g., town, property type).Understanding data types and null entries is essential before proceeding with analysis.
summary_df = pd.DataFrame({
'Column': real_estate.columns,
'Data Type': real_estate.dtypes.values,
'Missing Values': real_estate.isnull().sum().values
})
summary_df| Column | Data Type | Missing Values | |
|---|---|---|---|
| 0 | Serial Number | int64 | 0 |
| 1 | List Year | int64 | 0 |
| 2 | Date Recorded | object | 0 |
| 3 | Town | object | 0 |
| 4 | Address | object | 0 |
| 5 | Assessed Value | int64 | 0 |
| 6 | Sale Amount | float64 | 0 |
| 7 | Sales Ratio | float64 | 0 |
| 8 | Property Type | object | 0 |
| 9 | Residential Type | object | 0 |
real_estate.describe()| Serial Number | List Year | Assessed Value | Sale Amount | Sales Ratio | |
|---|---|---|---|---|---|
| count | 1.439670e+05 | 143967.000000 | 143967.000000 | 143967.000000 | 143967.000000 |
| mean | 6.270426e+05 | 2017.567123 | 161573.011760 | 247757.685692 | 0.668229 |
| std | 5.434197e+06 | 1.124513 | 76604.109558 | 119804.305879 | 0.134377 |
| min | 1.610000e+02 | 2016.000000 | 1680.000000 | 2000.000000 | 0.298372 |
| 25% | 1.603150e+05 | 2017.000000 | 105980.000000 | 160000.000000 | 0.584300 |
| 50% | 1.709830e+05 | 2018.000000 | 147000.000000 | 228000.000000 | 0.651503 |
| 75% | 1.900580e+05 | 2019.000000 | 206900.000000 | 320000.000000 | 0.734146 |
| max | 1.920000e+08 | 2019.000000 | 380370.000000 | 684000.000000 | 1.078614 |
This summary provides a snapshot of key distribution characteristics. We observe that:
Serial Number is a unique identifier, ranging widely due to the sequential nature of property listings. It is not used for statistical analysis but ensures data integrity.
List Year ranges from 2016 to 2019. This indicates that the dataset covers four years of listing activity, allowing for temporal trend analysis. The values are evenly distributed across these years.
Assessed Value ranges from 1,680 to 380,370. This variable shows some skewness toward higher property values. The average lies below the 75th percentile, indicating that while most properties are modestly valued, a few high-value properties exist and pull the mean upward.
Sale Amount ranges from 2,000 to 684,000 after outlier removal. Like Assessed Value, it is right-skewed, with many properties clustered around typical residential prices and a few much higher-value sales. The median is a better central measure here.
Sales Ratio ranges from ~0.29 to 1.07. A value of 1 represents a perfect match between sale amount and assessed value. Most values lie below 1, indicating that many properties sold for less than their assessed value. This ratio helps in evaluating market valuation accuracy across towns or years.
real_estate.describe(include='object')| Date Recorded | Town | Address | Property Type | Residential Type | |
|---|---|---|---|---|---|
| count | 143967 | 143967 | 143967 | 143967 | 143967 |
| unique | 1167 | 169 | 133891 | 5 | 5 |
| top | 7/1/2019 | Bridgeport | 71 AIKEN ST | Single Family | single-residence |
| freq | 454 | 5806 | 24 | 103968 | 103968 |
This section highlights the composition and variety within the categorical features of the dataset:
Date Recorded spans from 2016 to 2020, capturing the transaction recording date. This field is useful for tracking seasonal or year-specific market behavior. The format is string-based, and cleaning may be required before time series analysis.
Town includes 169 unique town names. This variable is crucial for geographic segmentation. Some towns have a significantly higher number of property sales, which could be indicative of population density or economic activity.
Address is a string-based location descriptor. While useful for identifying specific properties, it often includes inconsistent formatting or missing values (e.g., ‘nan’), making it less reliable for grouping without cleaning.
Property Type is consistent across the dataset with a single value: Residential. This simplifies analysis, as there’s no need to disaggregate by different property purposes like commercial or industrial.
Residential Type includes five subtypes: Condo, Single Family, Two Family, Three Family, and Four Family. These categories reflect structural distinctions and can heavily influence sale amounts and assessment values. For example, single-family homes may generally command higher per-unit values than multi-family properties.
# Create the histogram
fig = px.histogram(
real_estate,
x='Sale Amount',
nbins=30,
title='Distribution of Sale Amounts',
labels={'Sale Amount': 'Sale Amount'},
color_discrete_sequence=['skyblue']
)
# Update layout for better visuals
fig.update_layout(
xaxis_title='Sale Amount',
yaxis_title='Number of Properties',
bargap=0.1,
template='plotly_white'
)
# Make responsive and show hover/click info
fig.show(config={"responsive": True})
fig.write_image(os.path.join(results_dir, 'Distribution_of_Sale_Amounts.jpg'))
fig.write_image(os.path.join(results_dir, 'Distribution_of_Sale_Amounts.png'))The histogram above illustrates the distribution of property sale amounts across the dataset (filtered to exclude extreme outliers above $1M) ’s a detailed breakdown of what the plot reveals:
Skewness: The distribution is clearly right-skewed — a common pattern in real estate. Most properties are sold for relatively modest amounts, with fewer sales occurring at very high prices.
Mode (Most Frequent Range): The tallest bars are centered around $150,000 to $200,000, where we see the peak number of transactions (over 12,000). This suggests the majority of homes fall within this pricing range.
Density Curve (KDE): The smooth blue curve overlays the bars and visually reinforces the distribution shape. The sharp rise and gradual taper highlight the concentration of mid-priced properties, followed by a long tail extending toward higher prices.
Spread: Sale amounts range broadly from near $0 up to around $700,000, with fewer properties selling at the extremes. The lower end includes possible undervalued or distressed sales, while the upper end includes premium properties.
Implications: This right skew implies that using the median sale price is more appropriate than the mean, which may be inflated by a small number of expensive sales. The long tail also suggests log transformation might be useful if this variable is used in predictive modeling.
# Count the number of properties per Residential Type
res_type_counts = real_estate['Residential Type'].value_counts().reset_index()
res_type_counts.columns = ['Residential Type', 'Count']
# Create bar chart
fig = px.bar(
res_type_counts,
x='Residential Type',
y='Count',
title='Count of Each Residential Type',
color_discrete_sequence=['#1f77b4']
)
fig.update_layout(
xaxis_title='Residential Type',
yaxis_title='Number of Properties',
xaxis_tickangle=30,
template='plotly_white'
)
# Show interactive chart
fig.show(config={"responsive": True})
# Save images
fig.write_image(os.path.join(results_dir, 'Count_of_Each_Residential_Type.jpg'))
fig.write_image(os.path.join(results_dir, 'Count_of_Each_Residential_Type.png'))The bar chart above displays the distribution of properties by Residential Type, giving us a categorical breakdown of the housing structures in the dataset.
Dominance of Single Family Homes: Single Family residences are the most common type in the dataset, significantly outnumbering all other categories. This reflects the typical composition of residential areas, where detached homes are the standard property form.
Prevalence of Condos and Two Family Units: Condo units and Two Family homes are also quite prominent, suggesting a good mix of individual and shared housing structures. Condos may be more concentrated in urban centers, while Two Family homes could reflect multi-generational housing or rental opportunities.
Smaller Share of Three and Four Family Homes: Three Family and especially Four Family properties are much less common. These types likely represent either older constructions, niche rental properties, or homes in specific zoning areas that allow multi-unit dwellings.
Insights for Market Segmentation: Understanding this breakdown is valuable for market analysis and pricing comparisons. For example, pricing strategies, maintenance costs, and investment appeal can differ greatly between Single Family and multi-family properties.
# Ensure numeric conversion
real_estate['Sales Ratio'] = pd.to_numeric(real_estate['Sales Ratio'], errors='coerce')
# Replace inf and -inf with NaN
real_estate = real_estate.replace([np.inf, -np.inf], np.nan)
# Drop rows with NaN in Sales Ratio
real_estate = real_estate.dropna(subset=['Sales Ratio'])# Create histogram of Sales Ratio
fig = px.histogram(
real_estate,
x='Sales Ratio',
nbins=30,
title='Distribution of Sales Ratio',
labels={'Sales Ratio': 'Sales Ratio'},
color_discrete_sequence=['skyblue']
)
fig.update_layout(
xaxis_title='Sales Ratio',
yaxis_title='Number of Properties',
bargap=0.1,
template='plotly_white'
)
fig.show(config={"responsive": True})
# Save images
fig.write_image(os.path.join(results_dir, 'Distribution_of_Sales_Ratio.jpg'))
fig.write_image(os.path.join(results_dir, 'Distribution_of_Sales_Ratio.png'))This histogram displays the distribution of the Sales Ratio, which compares a property’s sale price to its assessed value (i.e., Sale Amount ÷ Assessed Value). Here’s what the chart reveals:
Peak & Central Tendency: The distribution peaks around a Sales Ratio of ~0.65, indicating that most properties were sold for about 65% of their assessed value. This suggests a tendency toward undervaluation or conservative government assessments.
Distribution Shape: The plot is slightly right-skewed, but generally resembles a bell-shaped (normal-like) curve. The bulk of the properties lie between 0.5 and 0.8, forming a concentrated central cluster.
Lower and Upper Bounds:
Market Insight: The fact that the vast majority of Sales Ratios are below 1.0 implies that properties in this dataset typically sold for less than their assessed value, which could reflect:
Implication for Valuation Models: The Sales Ratio provides valuable insight into how closely government-assessed values align with actual market behavior.
A Sales Ratio below 1.0 (which is the majority) suggests that most properties were sold for less than their assessed value, potentially indicating that assessments are overestimated.
A Sales Ratio close to or above 1.0 suggests properties that sold at or above their assessed values—this may indicate strong demand or accurate valuation.
# Get top 20 towns by count
top_towns = real_estate['Town'].value_counts().nlargest(20).reset_index()
top_towns.columns = ['Town', 'Count']
# Create bar chart
fig = px.bar(
top_towns,
x='Town',
y='Count',
title='Top 20 Towns by Number of Sales',
color_discrete_sequence=['skyblue']
)
fig.update_layout(
xaxis_title='Town',
yaxis_title='Number of Sales',
xaxis_tickangle=45,
template='plotly_white'
)
fig.show(config={"responsive": True})
# Save images
fig.write_image(os.path.join(results_dir, 'Top_20_Towns_by_Number_of_Sales.jpg'))
fig.write_image(os.path.join(results_dir, 'Top_20_Towns_by_Number_of_Sales.png'))This bar chart displays the 20 towns with the highest number of real estate transactions. Each bar reflects the volume of property sales in a given town.
Key Observations: Bridgeport dominates the chart with nearly 6,000 sales, significantly ahead of the second-place town. This sharp lead indicates a highly active real estate market in Bridgeport.
Hamden, Waterbury, and Stamford follow with over 3,000 sales each, showing strong but more moderate activity.
From Meriden onward, the number of sales gradually decreases, but all towns shown still reflect high engagement in the market compared to the full list of towns in the dataset.
The bars are relatively evenly spaced among the rest, suggesting a moderately competitive real estate environment in those areas.
Implications: Bridgeport’s high sales volume might be attributed to factors like population size, housing turnover, or market affordability. It could be a prime candidate for targeted investment or policy focus.
Towns like Hamden, Waterbury, and Stamford may represent stable and active secondary markets, appealing to both buyers and sellers.
Local authorities and developers might consider prioritizing infrastructure, zoning, or community development efforts in the top towns to support continued growth and meet demand.
# Select top 5 towns by number of sales
#| echo: false
#| output: false
#| warning: false
top_towns = real_estate['Town'].value_counts().nlargest(5).index
avg_sales_ratio_top = avg_sales_ratio[avg_sales_ratio['Town'].isin(top_towns)]# Calculate average sales ratio by Town and Residential Type
avg_sales_ratio = real_estate.groupby(['Town', 'Residential Type'])['Sales Ratio'].mean().reset_index()
# Select top 5 towns by number of sales
top_towns = real_estate['Town'].value_counts().nlargest(5).index
avg_sales_ratio_top = avg_sales_ratio[avg_sales_ratio['Town'].isin(top_towns)]
# Create grouped bar chart
fig = px.bar(
avg_sales_ratio_top,
x='Town',
y='Sales Ratio',
color='Residential Type',
barmode='group',
title='Average Sales Ratio per Town by Residential Type (Top 5 Towns)',
color_discrete_sequence=px.colors.sequential.Blues
)
fig.update_layout(
xaxis_title='Town',
yaxis_title='Average Sales Ratio',
xaxis_tickangle=45,
template='plotly_white',
legend_title_text='Residential Type'
)
fig.show(config={"responsive": True})
# Save images
fig.write_image(os.path.join(results_dir, 'income_distribution_by_agegroup_bar_plot.jpg'))
fig.write_image(os.path.join(results_dir, 'income_distribution_by_agegroup_bar_plot.png'))6.5.1 Interpretation: Average Sales Ratio per Town by Residential Type
This grouped bar chart shows the average sales ratio for different residential types across the top 5 towns: Bridgeport, Hamden, Meriden, Stamford, and Waterbury.
- Across all towns, single-residence and triplex-residence types tend to exhibit higher average sales ratios, often exceeding 0.65.
- Quadplex-residences generally show the lowest average sales ratios, especially in Stamford and Waterbury.
- Hamden and Meriden stand out for having consistently higher sales ratios, suggesting stronger or more stable housing demand.
This plot reveals that both town and residential type play a role in market performance — an important insight for investors, analysts, and policymakers.
# Plot
# Group and average
avg_sale_by_year = real_estate.groupby('List Year', as_index=False)['Sale Amount'].mean()
# Create line chart with markers
fig = px.line(
avg_sale_by_year,
x='List Year',
y='Sale Amount',
title='Average Sale Amount by List Year',
markers=True,
line_shape='linear',
color_discrete_sequence=['teal']
)
fig.update_layout(
xaxis_title='List Year',
yaxis_title='Average Sale Amount',
xaxis=dict(tickmode='linear', tickvals=avg_sale_by_year['List Year']),
template='plotly_white'
)
fig.show(config={"responsive": True})
# Save images
fig.write_image(os.path.join(results_dir, 'Average_Sale_Amount_by_List_Year.jpg'))
fig.write_image(os.path.join(results_dir, 'Average_Sale_Amount_by_List_Year.png'))6.6.1 Interpretation: Actual Sale Amounts Over Time
This line chart illustrates the actual sale amounts for properties across the time range.
- From 2016 to 2018, sales activity was relatively lower, possibly indicating fewer transactions or lower market demand.
- A significant rise in sales amounts begins in 2018 and continues into 2019, closely aligning with the assessed value increase from the same period.
- The post-2018 surge suggests a market shift where real buyer activity and transaction values started matching the earlier assessment growth.
This chart highlights how market realization (sales) tends to follow assessment trends, reflecting delayed responses in actual property transactions.
# Plot
# Group and average
avg_sale_by_year = real_estate.groupby('List Year', as_index=False)['Assessed Value'].mean()
# Create line chart
fig = px.line(
avg_sale_by_year,
x='List Year',
y='Assessed Value',
title='Average Assessed Value by List Year',
markers=True,
line_shape='linear',
color_discrete_sequence=['teal']
)
fig.update_layout(
xaxis_title='List Year',
yaxis_title='Average Assessed Value',
xaxis=dict(tickmode='linear', tickvals=avg_sale_by_year['List Year']),
template='plotly_white'
)
fig.show(config={"responsive": True})
# Save images
fig.write_image(os.path.join(results_dir, 'Average_Assessed_Value_by_List_Year.jpg'))
fig.write_image(os.path.join(results_dir, 'Average_Assessed_Value_by_List_Year.png'))6.7.1 Interpretation: Assessed Property Values Over Time
This line chart displays how the assessed property values changed over the analyzed period.
- Between 2016 and 2018, the assessed values were relatively higher, indicating consistent or anticipated market value during that period.
- From 2018 to 2019, there is a noticeable upward boost, showing a sharp increase in assessment values — possibly due to reevaluation of properties or broader economic factors.
- This trend suggests that the market’s perceived value of properties increased before actual sales amounts caught up.
Overall, assessed values may reflect official estimates or expectations in the housing market, which can precede actual pricing behavior.
# Plot pie chart
# Prepare data
res_cat_counts = real_estate['Residential Type'].value_counts().reset_index()
res_cat_counts.columns = ['Residential Type', 'Count']
# Create pie chart
fig = px.pie(
res_cat_counts,
names='Residential Type',
values='Count',
title='Residential Type Distribution',
color_discrete_sequence=px.colors.sequential.Blues,
hole=0
)
# Show values inside slices
fig.update_traces(
textinfo='percent+label',
textposition='inside',
insidetextorientation='radial'
)
# Update layout: increase figure size
fig.update_layout(
template='plotly_white',
width=800,
height=600
)
fig.show(config={"responsive": True})
# Save images
fig.write_image(os.path.join(results_dir, 'Residential_Type_Distribution.jpg'))
fig.write_image(os.path.join(results_dir, 'Residential_Type_Distribution.png'))6.8.1 Interpretation: Residential Type Distribution
This pie chart illustrates the distribution of property types in the dataset.
- Single-residence properties dominate the dataset, making up 72.2% of all records — indicating a clear market focus.
- Appartments come next at 20.6%, followed by duplex-residences (4.7%), with triplex and quadplex residences making up the smallest shares (2.1% and 0.4%, respectively).
The visualization reveals that multi-family housing (duplex, triplex, quadplex) is relatively rare compared to individual residential units, which may reflect development patterns or buyer preferences.